Extend Nemo AutoTokenizer & SentencePieceTokenizer API for TensorRT-LLM & AMMO evaluation scripts usage #8818

janekl · 2024-04-04T10:50:28Z

What does this PR do ?

Extending NeMo's AutoTokenizer and SentencePieceTokenizer for consistency with HuggingFace tokenizers API as it is used throughout TensorRT-LLM and AMMO in evaluation tools.

This is necessary to seamlessly enable model validation -- including quantized variants -- with the current scripts and to avoid code duplication.

Example use-cases to support:

overall API link1
encode: link1 and link2
decode: link1
batch_decode: link1
batch_encode_plus link2
setting pad_token_id: link1

Collection: NLP

Changelog

adding different variants of encode/decode methods
adding pad_token_id (with setter enabled for HF AutoTokenizer wrapper) and eos_token_id property attributes for HF tokenizer

Usage

See examples listed above.

Jenkins CI

To run Jenkins, a NeMo User with write access must comment jenkins on the PR.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

janekl · 2024-04-04T10:57:39Z

jenkins

janekl · 2024-04-12T12:46:20Z

jenkins

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

…zers Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

for more information, see https://pre-commit.ci

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

nemo/collections/common/tokenizers/tokenizer_spec.py

aklife97 · 2024-04-22T19:41:06Z

LGTM, but can we check if autotokenizer and local sentencepiece outputs match for the new methods implemented before we merge?

it is probably fine if they're not identical, but we need to know where the differences arise and also document them if any

github-actions · 2024-05-08T01:39:50Z

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

github-actions · 2024-05-23T01:45:54Z

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

github-actions · 2024-05-30T01:47:01Z

This PR was closed because it has been inactive for 7 days since being marked as stale.

github-actions bot added the common label Apr 4, 2024

janekl requested review from suiyoubi and titu1994 April 4, 2024 10:57

janekl force-pushed the jlasek/nemo_autotokenizer_extensions branch from 0251e96 to 4abe31d Compare April 12, 2024 12:28

janekl changed the title ~~Extend Nemo AutoTokenizer API for TensorRT-LLM & AMMO evaluation scripts usage~~ Extend Nemo AutoTokenizer & SentencePieceTokenizer API for TensorRT-LLM & AMMO evaluation scripts usage Apr 12, 2024

ericharper requested a review from aklife97 April 17, 2024 15:24

janekl force-pushed the jlasek/nemo_autotokenizer_extensions branch from 64350e1 to 708cde7 Compare April 18, 2024 11:19

janekl and others added 8 commits April 18, 2024 13:21

Extend Nemo AutoTokenizer for TRT-LLM evaluation usage

f4cef51

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

Set pad_token_id correctly

f5f5341

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

Extend TokenizerSpec for HF compliance

f14e8d5

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

Extend SentencePieceTokenizer for consistency with HuggingFace tokeni…

85e5098

…zers Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

Typing for pad_token_id.setter

0f5d2b2

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

Don't expose pad_token and eos_token

3489364

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

d8a4e7d

for more information, see https://pre-commit.ci

Add test for SentencePieceTokenizer extensions

4c2cc81

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

janekl force-pushed the jlasek/nemo_autotokenizer_extensions branch from 708cde7 to 4c2cc81 Compare April 18, 2024 11:22

github-advanced-security bot found potential problems Apr 18, 2024

View reviewed changes

nemo/collections/common/tokenizers/tokenizer_spec.py Dismissed Show dismissed Hide dismissed

github-actions bot added the stale label May 8, 2024

janekl removed the stale label May 8, 2024

github-actions bot added the stale label May 23, 2024

github-actions bot closed this May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend Nemo AutoTokenizer & SentencePieceTokenizer API for TensorRT-LLM & AMMO evaluation scripts usage #8818

Extend Nemo AutoTokenizer & SentencePieceTokenizer API for TensorRT-LLM & AMMO evaluation scripts usage #8818

janekl commented Apr 4, 2024 •

edited

Loading

janekl commented Apr 4, 2024

janekl commented Apr 12, 2024

aklife97 commented Apr 22, 2024

github-actions bot commented May 8, 2024

github-actions bot commented May 23, 2024

github-actions bot commented May 30, 2024

Extend Nemo AutoTokenizer & SentencePieceTokenizer API for TensorRT-LLM & AMMO evaluation scripts usage #8818

Extend Nemo AutoTokenizer & SentencePieceTokenizer API for TensorRT-LLM & AMMO evaluation scripts usage #8818

Conversation

janekl commented Apr 4, 2024 • edited Loading

What does this PR do ?

Changelog

Usage

Jenkins CI

Before your PR is "Ready for review"

Who can review?

janekl commented Apr 4, 2024

janekl commented Apr 12, 2024

aklife97 commented Apr 22, 2024

github-actions bot commented May 8, 2024

github-actions bot commented May 23, 2024

github-actions bot commented May 30, 2024

janekl commented Apr 4, 2024 •

edited

Loading